feat(rivetkit): expose low-cardinality metrics routes by NathanFlurry · Pull Request #5019 · rivet-dev/rivet

NathanFlurry · 2026-05-11T13:34:47Z

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

claude · 2026-05-11T13:44:42Z

PR #5019 Review: feat(rivetkit): expose low-cardinality metrics routes

Overview

This PR makes two interconnected changes:

Metrics refactoring — switches actor metrics from high-cardinality labels (actor_id_gen, actor_key, envoy_key) to a single low-cardinality actor_name label, removes the RetainedActorMetrics cleanup machinery, renames metric namespaces, and drops several per-actor gauge metrics.
New public route API — replaces the diagnostics() method with three separate HTTP-passthrough routes (health, metadata, metrics) exposed via Registry.routes in TypeScript and via new #[napi] methods in the Rust bridge.

Security Concerns

/metrics endpoint no longer requires auth (breaking security invariant)

In registry/http.rs, the auth check was deleted entirely:

// Before: checked RIVETKIT_METRICS_ENABLED + RIVETKIT_METRICS_TOKEN bearer token
// After:
let metrics = crate::metrics_endpoint::render_prometheus_metrics()?;
bytes_response(StatusCode::OK, &metrics.content_type, metrics.body)

The RIVETKIT_METRICS_ENABLED / RIVETKIT_METRICS_TOKEN env-var gate existed so operators could enable scraping without accidentally exposing metrics to any caller who can reach the actor HTTP port. Removing it makes metrics available to everyone.

Similarly, the new Registry.routes.prometheusMetrics(request?: Request) accepts a request but ignores it — the _request underscore name suggests bearer-token auth was planned but not wired up. There is currently no auth layer on the externally-facing metrics route.

Breaking Changes

Metric renames will break existing dashboards and alerts

Registry-level "rivet" prefix removed: old names were rivet_{name}; new names embed rivetkit_ or envoy_ directly, so every metric name changes.
Specific renames: pegboard_envoy_* → envoy_*, actor_* → rivetkit_actor_*, actor_kv_* → pegboard_actor_kv_*, SQLite commit envoy timings renamed.
The SQLITE_METRICS.md doc is updated, but any Grafana dashboards or Prometheus alert rules using old names will silently stop reporting.

Code Quality

No-op methods are dead code, not backwards-compatibility shims

Several methods are converted to empty no-ops:

pub(crate) fn set_queue_depth(&self, _depth: u32) {}
pub(crate) fn set_active_connections(&self, _count: usize) {}
pub(crate) fn set_lifecycle_inbox_depth(&self, _depth: usize) {}
// etc.

Per CLAUDE.md, backwards-compatibility hacks like this should be avoided. If the metric is intentionally dropped, the callers should be updated to remove the call sites too. Keeping empty stubs is misleading — it looks like metrics are tracked when they aren't.

Default for ActorMetrics uses an empty actor_name

impl Default for ActorMetrics {
    fn default() -> Self {
        Self::new("")  // increments actor_active_count{actor_name=""}
    }
}

This creates a live Prometheus label entry with actor_name="" that increments and decrements on drop. Any code path hitting Default::default() produces a confusing actor_name="" entry in rendered metrics.

serve_config_from_js serverless_cache_envoy parameter is always true

Both call sites pass true. Consider hardcoding serverless_cache_envoy: true inside the function rather than exposing it as a parameter.

CoreEnvoyHandle wrapper is a thin newtype

CoreEnvoyHandle wraps EnvoyHandle with only a status() method returning CoreEnvoyStatus. The extra wrapper type adds indirection without meaningful encapsulation; EnvoyHandle could expose these methods directly.

candidates[0]! in #activeConfiguredRegistry

The old code iterated all candidates; the new code silently takes only the first. A comment explaining the single-candidate assumption would help.

Observability Concerns

Removed per-actor granularity is intentional but irreversible

Replacing instance-level labels with just actor_name means all metrics are aggregated by actor type. This is the stated goal (low-cardinality) but removes the ability to debug a single misbehaving actor instance via Prometheus. Ensure the inspector endpoint still covers per-instance diagnostics and this trade-off is documented.

actor_active_count retention behavior changed

The old actor_active gauge was retained for 10 minutes after actor shutdown. The new actor_active_count decrements immediately on Drop. Prometheus may now see the counter drop to 0 between scrape intervals rather than decaying slowly.

Serverless no-envoy health is optimistic

None => health_response(200, "ok", &version),

When serverless mode has no envoy, health returns 200 ok. The equivalent non-serverless case returns 503 starting. The asymmetry may be intentional but should be documented.

Test Coverage

Good additions: actor_startup_duration_metrics_render and actor_active_count_tracks_metric_lifetime.
Missing: no tests for the new health(), metadata(), or metrics() NAPI methods.
Missing: no test verifying that no-op methods don't accidentally produce metric entries.

Minor

version: "wasm" is hardcoded in WasmCoreRuntime.registryHealth() instead of using a real version constant.
route_package_version fallback returns env!("CARGO_PKG_VERSION") (Rust crate version) before the registry starts — health requests before startup return the wrong version string.
Indentation regression in index.d.ts: */ closing comment dropped one space.

Summary

The low-cardinality direction is the right call for production metrics systems. The main blockers are the removed auth on the runtime HTTP handler and the missing auth on the new public prometheusMetrics route. The no-op dead code and metric rename breaks are worth cleaning up before merging.

Generated by claude[bot]

NathanFlurry · 2026-05-11T13:46:00Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

This was referenced May 11, 2026

WIP: counter latency monitor #5011

Draft

feat(rivetkit-core): expose metrics endpoint #5013

Draft

This was referenced May 11, 2026

fix(sqlite): fence remote requests by generation #5014

Draft

fix(runtime): avoid bounded dispatch channel stalls #5015

Draft

fix(rivetkit): expose client actor metadata #5016

Draft

feat(kitchen-sink): add load testing harness #5017

Draft

This was referenced May 11, 2026

docs(serverless): document timeout tuning #5018

Draft

feat(rivetkit): add current actor metrics #5020

Draft

fix(rivetkit): require engine ping for health #5022

Draft

abcxff mentioned this pull request May 11, 2026

temp: patch wasm-pack to point to latest binary #5037

Draft

11 tasks

MasterPtato mentioned this pull request May 12, 2026

fix(envoy-client): dont abort drain after 20s #5043

Draft

11 tasks

feat(rivetkit): expose low-cardinality metrics routes

72886b4

MasterPtato force-pushed the counter-latency/serverless-docs branch from cb4ffa1 to e646699 Compare May 14, 2026 17:26

MasterPtato force-pushed the counter-latency/rivetkit-metrics-routes branch from 3a9c1be to 72886b4 Compare May 14, 2026 17:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rivetkit): expose low-cardinality metrics routes#5019

feat(rivetkit): expose low-cardinality metrics routes#5019
NathanFlurry wants to merge 1 commit into
counter-latency/serverless-docsfrom
counter-latency/rivetkit-metrics-routes

NathanFlurry commented May 11, 2026

Uh oh!

claude Bot commented May 11, 2026 •

edited

Loading

Uh oh!

NathanFlurry commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NathanFlurry commented May 11, 2026

Description

Type of change

How Has This Been Tested?

Checklist:

Uh oh!

claude Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR #5019 Review: feat(rivetkit): expose low-cardinality metrics routes

Overview

Security Concerns

Breaking Changes

Code Quality

Observability Concerns

Test Coverage

Minor

Summary

Uh oh!

NathanFlurry commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented May 11, 2026 •

edited

Loading

NathanFlurry commented May 11, 2026 •

edited

Loading